Analysis of Life Expectancy at Age 60

Group Name: wallaby

Group members: Miaoyang Kong(), Jiaqi Guo(), Yutong Wei()

Introduction

We are interested in life expectancy because we read a piece of news from THE NEW YORK TIME, which is talking about U.S. Life Expectancy Falls Again in ‘Historic’ Setbackwebsite. In recent years, there has been an increasing concern about the length of life span. In addition, many countries have seen an aging population. We hope to do research about life expectancy focusing on finding the effect of 6 explanatory variables on the life expectancy.

Main Question

What variables have significant effect on life expectancy at age 60?

Data Section

Source Data

The data of life expectancy is collected by World Health Organizationwebsite. Since the statistics for life expectancy at 60 is predicted for the years 2000, 2010, 2015, and 2019, we will concentrate on these four years. Life expectancy at age 60 reflects the overall mortality level of a population over 60 years. It summarizes the mortality pattern that prevails across all age groups above 60 years. We therefore focused on the impact of mortality from various diseases on life expectancy and wanted to investigate which disease has a greater impact on life expectancy through disease mortality. Because we wish to rule out the impact of the newborn mortality rate on life expectancy, we chose the data for life expectancy at age 60. Furthermore, we are more focused on how factors like money and education affect life expectancy.

For predictor factors, we choose the number of death caused by tuberculosiswebsite, the number of death casued by Noncommunicable diseases(NCD)website, and undernourishment to figure out how specific kind of disease or unhealthy condition may affect the life expectancywebsite; we choose suicide rate to figure out how suicide may affect the life expectancywebsite; we choose the enrollment rate of tertiary school to figure out how education level may affect the life expectancywebsite; we choose the per adult national income to figure out how the economic level may affect the life expectancywebsite. We also select the observations of the predictor factors from the years 2000, 2010, 2015, and 2019.

Key Variables from the Airport Wait Times of Boston international Airport Data
Name Description
life_expectancy (response) Life expectancy at age 60 (years)
tuberculousis Estimated number of deaths due to tuberculosis, excluding HIV
NCD Number of deaths attributed to non-communicable diseases (in thousands)
income Per adult national income
suiside Age-standardized mortality rate (per 100 000 population)
education School enrollment, tertiary (% gross)
undernourishment Prevalence of undernourishment (% of population)

We omit those observations because certain countries’ data on undernourishment and education are missing.

total <- read_csv("merge.csv")
total[total < 0] <-NA
total=total %>% na.omit()

total %>%
  head(10)
## # A tibble: 10 × 9
##    Country    Year Life expect…¹ Numbe…² Age-s…³ Total…⁴ Per a…⁵ Schoo…⁶ Preva…⁷
##    <chr>     <dbl>         <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Albania    2000          19        23    5.23    656.  21554.   15.5      4.9
##  2 Albania    2010          21.3       9    7.63    506.  37419.   44.5      5.8
##  3 Albania    2015          21.1       8    4.23    514.  40068.   62.0      4.9
##  4 Albania    2019          21.0       8    3.72    602.  43732.   59.8      4.3
##  5 Algeria    2010          21.4    3000    3       488.  26267.   29.9      4.3
##  6 Algeria    2015          21.8    3200    2.72    461.  28058.   36.8      2.8
##  7 Algeria    2019          22.0    2800    2.6     446.  25059.   52.6      2.5
##  8 Angola     2015          16.7   16000   13.3     639.  13326.    8.40    14.5
##  9 Argentina  2000          20.2     890    9.2     540.  17885.   54.0      3  
## 10 Argentina  2010          20.6     580    8.43    491.  37678.   73.2      3.1
## # … with abbreviated variable names ¹​`Life expectancy`,
## #   ²​`Number of death due to tuberculosis, excluding HIV`,
## #   ³​`Age-standarized suicide rates (per 100000 population)`,
## #   ⁴​`Total NCD Death(inthousands)`, ⁵​`Per adult national income`,
## #   ⁶​`School enrollment, tertiary (% gross)`,
## #   ⁷​`Prevalence of undernourishment (% of population)`
print(total)
## # A tibble: 409 × 9
##    Country    Year Life expect…¹ Numbe…² Age-s…³ Total…⁴ Per a…⁵ Schoo…⁶ Preva…⁷
##    <chr>     <dbl>         <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Albania    2000          19        23    5.23    656.  21554.   15.5      4.9
##  2 Albania    2010          21.3       9    7.63    506.  37419.   44.5      5.8
##  3 Albania    2015          21.1       8    4.23    514.  40068.   62.0      4.9
##  4 Albania    2019          21.0       8    3.72    602.  43732.   59.8      4.3
##  5 Algeria    2010          21.4    3000    3       488.  26267.   29.9      4.3
##  6 Algeria    2015          21.8    3200    2.72    461.  28058.   36.8      2.8
##  7 Algeria    2019          22.0    2800    2.6     446.  25059.   52.6      2.5
##  8 Angola     2015          16.7   16000   13.3     639.  13326.    8.40    14.5
##  9 Argentina  2000          20.2     890    9.2     540.  17885.   54.0      3  
## 10 Argentina  2010          20.6     580    8.43    491.  37678.   73.2      3.1
## # … with 399 more rows, and abbreviated variable names ¹​`Life expectancy`,
## #   ²​`Number of death due to tuberculosis, excluding HIV`,
## #   ³​`Age-standarized suicide rates (per 100000 population)`,
## #   ⁴​`Total NCD Death(inthousands)`, ⁵​`Per adult national income`,
## #   ⁶​`School enrollment, tertiary (% gross)`,
## #   ⁷​`Prevalence of undernourishment (% of population)`

After observing the non-linearity issue in the origin exploratory plot, we decide to apply log transformation to Number of death due to tuberculosis, excluding HIV`, Total NCD Death(inthousands), Per adult national income, Age-standarized suicide rates (per 100000 population), Prevalence of undernourishment (% of population). Additionally, we applied square root transformation to the variable of School enrollment, tertiary (% gross).

This is exploratory plot of our dataset. We observe a positive relationship between life expectancy and income, between life expectancy and schooling. In addition, We observe a negative relationship between life expectancy and NCD.

total_transform = log(total[-(1:2)])
total_transform$`Life expectancy` = 10^(total_transform$`Life expectancy`)
total_transform$`School enrollment, tertiary (% gross)` = sqrt(10^(total_transform$`School enrollment, tertiary (% gross)`))
pairs(total_transform)

We produces a matrix of scatter plots for visualizing the correlation between variables. We are able to read the scatterplots of each pair visualized in right side of the plot and Pearson correlation value and significance display on the left side.

library(GGally)

ggpairs(total_transform[is.finite(rowSums(total_transform)),], lower = list(continuous = "cor", combo = "box_no_facet", discrete = "facetbar",na="na"),
    upper = list(continuous = wrap("smooth", alpha = 0.3, size=0.2)))

Method Section

We select multiple linear regression model, and include interaction effects. Moreover, we check the VIF value to facilitate us discover the variables which have significant effect on life expectancy.

Multiple Linear Regression Model

Since we want to find out the relationship between life expectancy (the dependent variable or response) and seven factors we are interested in (the independent variables or predictors),that are tuberculousis,NCD,income,suicide,education, and undernourishment, we select the multiple linear regression model.

#Define the response and predictors:
life_expectancy =total$`Life expectancy`
tuberculousis =total$`Number of death due to tuberculosis, excluding HIV`
NCD = total$`Total NCD Death(inthousands)`
income = total$`Per adult national income`
suicide = total$`Age-standarized suicide rates (per 100000 population)`
education=total$`School enrollment, tertiary (% gross)`
undernourishment = total$`Prevalence of undernourishment (% of population)`

We transform the independent variables using ‘log’ and ‘sqrt’ to meet linearity assumption.

log_tuberculousis=log10(tuberculousis)
log_NCD = log(NCD)
log_income=log(income)
log_suicide=log(suicide)
sqrt_education=sqrt(education)
log_undernourishment=log(undernourishment)

To accommodate the transformed independent variable, we create a new dataframe. We remove the missing data in the new dataframe because there is one negative observation in the income column and this observation becomes undefined after the income is “logged.”

new_data=data.frame(life_expectancy,log_tuberculousis,log_NCD,log_income,log_suicide,sqrt_education,log_undernourishment)
new_data[is.na(new_data) | new_data == "-Inf"] <- NA
new_data=new_data %>% na.omit()
new_data %>%
  head(10)
##    life_expectancy log_tuberculousis  log_NCD log_income log_suicide
## 1            19.00         1.3617278 6.486008   9.978319   1.6544113
## 2            21.31         0.9542425 6.226339  10.529923   2.0320878
## 3            21.13         0.9030900 6.242029  10.598345   1.4422020
## 4            21.03         0.9030900 6.400091  10.685847   1.3137237
## 5            21.37         3.4771213 6.189905  10.176057   1.0986123
## 6            21.81         3.5051500 6.132747  10.242027   1.0006319
## 7            22.04         3.4471580 6.099870  10.129001   0.9555114
## 8            16.71         4.2041200 6.460061   9.497503   2.5855058
## 9            20.18         2.9493900 6.291939   9.791727   2.2192035
## 10           20.62         2.7634280 6.197055  10.536835   2.1317968
##    sqrt_education log_undernourishment
## 1        3.941651            1.5892352
## 2        6.674523            1.7578579
## 3        7.874492            1.5892352
## 4        7.731656            1.4586150
## 5        5.467123            1.4586150
## 6        6.064760            1.0296194
## 7        7.253960            0.9162907
## 8        2.898437            2.6741486
## 9        7.346074            1.0986123
## 10       8.557535            1.1314021
lmmodel = lm(life_expectancy ~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment, data = new_data)
summary(lmmodel)
## 
## Call:
## lm(formula = life_expectancy ~ log_tuberculousis + log_NCD + 
##     log_income + log_suicide + sqrt_education + log_undernourishment, 
##     data = new_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4601 -0.3456 -0.0102  0.4256  1.6335 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          62.28159    0.88050  70.734  < 2e-16 ***
## log_tuberculousis    -0.13508    0.02833  -4.768 2.60e-06 ***
## log_NCD              -6.77359    0.11860 -57.113  < 2e-16 ***
## log_income            0.03669    0.03015   1.217    0.224    
## log_suicide          -0.26235    0.04385  -5.983 4.87e-09 ***
## sqrt_education        0.22116    0.02013  10.985  < 2e-16 ***
## log_undernourishment -0.35890    0.05465  -6.567 1.59e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5667 on 401 degrees of freedom
## Multiple R-squared:  0.9639, Adjusted R-squared:  0.9634 
## F-statistic:  1784 on 6 and 401 DF,  p-value: < 2.2e-16

Interaction Effect

To detect if there are interaction effects between pairs of variables that we are interested in, We choose to use AIC in a Stepwise Algorithm, which is an automated method that returns back the optimal set of model.

library(MASS)
lm1 = lm(life_expectancy ~1, data = new_data)
lm2 = lm(life_expectancy ~ (.)^2, data = new_data)
lm.both = stepAIC(lm1, direction="both", scope=list(upper=lm2,lower=lm1))
## Start:  AIC=886.51
## life_expectancy ~ 1
## 
##                        Df Sum of Sq    RSS    AIC
## + log_NCD               1    3231.6  334.4 -77.22
## + sqrt_education        1    1965.2 1600.7 561.72
## + log_undernourishment  1    1597.5 1968.4 646.08
## + log_income            1    1508.7 2057.2 664.08
## + log_tuberculousis     1     926.6 2639.3 765.74
## + log_suicide           1     306.4 3259.6 851.86
## <none>                              3566.0 886.51
## 
## Step:  AIC=-77.22
## life_expectancy ~ log_NCD
## 
##                        Df Sum of Sq    RSS     AIC
## + sqrt_education        1     163.6  170.7 -349.51
## + log_undernourishment  1     139.4  195.0 -295.25
## + log_income            1      80.2  254.1 -187.17
## + log_tuberculousis     1      45.5  288.8 -134.96
## + log_suicide           1       3.3  331.0  -79.32
## <none>                               334.4  -77.22
## - log_NCD               1    3231.6 3566.0  886.51
## 
## Step:  AIC=-349.51
## life_expectancy ~ log_NCD + sqrt_education
## 
##                          Df Sum of Sq     RSS     AIC
## + log_NCD:sqrt_education  1     22.12  148.58 -404.14
## + log_undernourishment    1     20.59  150.11 -399.97
## + log_tuberculousis       1     14.87  155.83 -384.71
## + log_suicide             1     10.68  160.02 -373.87
## + log_income              1      5.81  164.89 -361.63
## <none>                                 170.70 -349.51
## - sqrt_education          1    163.65  334.35  -77.22
## - log_NCD                 1   1430.05 1600.75  561.72
## 
## Step:  AIC=-404.14
## life_expectancy ~ log_NCD + sqrt_education + log_NCD:sqrt_education
## 
##                          Df Sum of Sq    RSS     AIC
## + log_undernourishment    1   15.3192 133.26 -446.53
## + log_tuberculousis       1   13.2115 135.37 -440.13
## + log_income              1    6.6980 141.88 -420.96
## + log_suicide             1    4.0796 144.50 -413.49
## <none>                                148.58 -404.14
## - log_NCD:sqrt_education  1   22.1201 170.70 -349.51
## 
## Step:  AIC=-446.53
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_NCD:sqrt_education
## 
##                                       Df Sum of Sq    RSS     AIC
## + sqrt_education:log_undernourishment  1   20.2196 113.04 -511.67
## + log_tuberculousis                    1    9.0503 124.21 -473.23
## + log_suicide                          1    6.2158 127.05 -464.02
## + log_income                           1    1.5270 131.74 -449.23
## + log_NCD:log_undernourishment         1    0.9989 132.26 -447.60
## <none>                                             133.26 -446.53
## - log_undernourishment                 1   15.3192 148.58 -404.14
## - log_NCD:sqrt_education               1   16.8451 150.11 -399.97
## 
## Step:  AIC=-511.67
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_NCD:sqrt_education + sqrt_education:log_undernourishment
## 
##                                       Df Sum of Sq    RSS     AIC
## + log_tuberculousis                    1    7.5627 105.48 -537.92
## + log_income                           1    1.5318 111.51 -515.24
## + log_suicide                          1    1.4366 111.61 -514.89
## - log_NCD:sqrt_education               1    0.5046 113.55 -511.85
## <none>                                             113.04 -511.67
## + log_NCD:log_undernourishment         1    0.0001 113.04 -509.67
## - sqrt_education:log_undernourishment  1   20.2196 133.26 -446.53
## 
## Step:  AIC=-537.92
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_NCD:sqrt_education + sqrt_education:log_undernourishment
## 
##                                          Df Sum of Sq    RSS     AIC
## + log_tuberculousis:sqrt_education        1    4.3464 101.13 -553.09
## + log_tuberculousis:log_undernourishment  1    2.0098 103.47 -543.77
## + log_suicide                             1    1.0408 104.44 -539.97
## + log_income                              1    0.8034 104.68 -539.04
## <none>                                                105.48 -537.92
## - log_NCD:sqrt_education                  1    0.5639 106.04 -537.75
## + log_NCD:log_undernourishment            1    0.2747 105.20 -536.99
## + log_tuberculousis:log_NCD               1    0.2617 105.22 -536.93
## - log_tuberculousis                       1    7.5627 113.04 -511.67
## - sqrt_education:log_undernourishment     1   18.7319 124.21 -473.23
## 
## Step:  AIC=-553.09
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_NCD:sqrt_education + sqrt_education:log_undernourishment + 
##     sqrt_education:log_tuberculousis
## 
##                                          Df Sum of Sq     RSS     AIC
## + log_suicide                             1    1.2563  99.877 -556.19
## - log_NCD:sqrt_education                  1    0.1743 101.308 -554.39
## + log_income                              1    0.7838 100.350 -554.26
## <none>                                                101.133 -553.09
## + log_NCD:log_undernourishment            1    0.2608 100.873 -552.14
## + log_tuberculousis:log_NCD               1    0.2469 100.887 -552.09
## + log_tuberculousis:log_undernourishment  1    0.0183 101.115 -551.16
## - sqrt_education:log_tuberculousis        1    4.3464 105.480 -537.92
## - sqrt_education:log_undernourishment     1   12.9780 114.111 -505.83
## 
## Step:  AIC=-556.19
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_suicide + log_NCD:sqrt_education + 
##     sqrt_education:log_undernourishment + sqrt_education:log_tuberculousis
## 
##                                          Df Sum of Sq     RSS     AIC
## + log_suicide:log_undernourishment        1    2.1311  97.746 -562.99
## + log_suicide:sqrt_education              1    1.7993  98.078 -561.61
## + log_tuberculousis:log_suicide           1    0.8870  98.990 -557.83
## - log_NCD:sqrt_education                  1    0.1005  99.978 -557.78
## + log_income                              1    0.8677  99.009 -557.75
## <none>                                                 99.877 -556.19
## + log_tuberculousis:log_NCD               1    0.4314  99.446 -555.96
## + log_NCD:log_undernourishment            1    0.1147  99.762 -554.66
## + log_NCD:log_suicide                     1    0.0192  99.858 -554.27
## + log_tuberculousis:log_undernourishment  1    0.0037  99.873 -554.20
## - log_suicide                             1    1.2563 101.133 -553.09
## - sqrt_education:log_tuberculousis        1    4.5618 104.439 -539.97
## - sqrt_education:log_undernourishment     1    9.6341 109.511 -520.62
## 
## Step:  AIC=-562.99
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_suicide + log_NCD:sqrt_education + 
##     sqrt_education:log_undernourishment + sqrt_education:log_tuberculousis + 
##     log_undernourishment:log_suicide
## 
##                                          Df Sum of Sq     RSS     AIC
## + log_NCD:log_suicide                     1    1.1485  96.597 -565.81
## - log_NCD:sqrt_education                  1    0.0139  97.760 -564.93
## + log_tuberculousis:log_NCD               1    0.6738  97.072 -563.81
## + log_income                              1    0.5306  97.215 -563.21
## + log_tuberculousis:log_suicide           1    0.4945  97.252 -563.06
## <none>                                                 97.746 -562.99
## + log_suicide:sqrt_education              1    0.1646  97.581 -561.68
## + log_NCD:log_undernourishment            1    0.0579  97.688 -561.23
## + log_tuberculousis:log_undernourishment  1    0.0294  97.717 -561.11
## - log_undernourishment:log_suicide        1    2.1311  99.877 -556.19
## - sqrt_education:log_tuberculousis        1    4.2920 102.038 -547.46
## - sqrt_education:log_undernourishment     1    9.1677 106.914 -528.41
## 
## Step:  AIC=-565.81
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_suicide + log_NCD:sqrt_education + 
##     sqrt_education:log_undernourishment + sqrt_education:log_tuberculousis + 
##     log_undernourishment:log_suicide + log_NCD:log_suicide
## 
##                                          Df Sum of Sq     RSS     AIC
## + log_tuberculousis:log_suicide           1    1.1502  95.447 -568.70
## - log_NCD:sqrt_education                  1    0.0173  96.615 -567.74
## + log_income                              1    0.6548  95.943 -566.59
## + log_suicide:sqrt_education              1    0.5846  96.013 -566.29
## <none>                                                 96.597 -565.81
## + log_tuberculousis:log_NCD               1    0.3704  96.227 -565.38
## + log_NCD:log_undernourishment            1    0.0569  96.541 -564.05
## + log_tuberculousis:log_undernourishment  1    0.0485  96.549 -564.02
## - log_NCD:log_suicide                     1    1.1485  97.746 -562.99
## - log_undernourishment:log_suicide        1    3.2604  99.858 -554.27
## - sqrt_education:log_tuberculousis        1    3.9694 100.567 -551.38
## - sqrt_education:log_undernourishment     1    8.3386 104.936 -534.03
## 
## Step:  AIC=-568.7
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_suicide + log_NCD:sqrt_education + 
##     sqrt_education:log_undernourishment + sqrt_education:log_tuberculousis + 
##     log_undernourishment:log_suicide + log_NCD:log_suicide + 
##     log_tuberculousis:log_suicide
## 
##                                          Df Sum of Sq     RSS     AIC
## - log_NCD:sqrt_education                  1    0.0133  95.461 -570.64
## + log_tuberculousis:log_NCD               1    0.6026  94.845 -569.28
## + log_income                              1    0.5942  94.853 -569.25
## + log_suicide:sqrt_education              1    0.5804  94.867 -569.19
## <none>                                                 95.447 -568.70
## + log_NCD:log_undernourishment            1    0.0302  95.417 -566.83
## + log_tuberculousis:log_undernourishment  1    0.0002  95.447 -566.70
## - log_tuberculousis:log_suicide           1    1.1502  96.597 -565.81
## - log_NCD:log_suicide                     1    1.8043  97.252 -563.06
## - log_undernourishment:log_suicide        1    3.3048  98.752 -556.81
## - sqrt_education:log_tuberculousis        1    3.6598  99.107 -555.35
## - sqrt_education:log_undernourishment     1    7.9569 103.404 -538.03
## 
## Step:  AIC=-570.64
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_suicide + sqrt_education:log_undernourishment + 
##     sqrt_education:log_tuberculousis + log_undernourishment:log_suicide + 
##     log_NCD:log_suicide + log_tuberculousis:log_suicide
## 
##                                          Df Sum of Sq     RSS     AIC
## + log_income                              1    0.6066  94.854 -571.24
## + log_tuberculousis:log_NCD               1    0.6030  94.858 -571.23
## + log_suicide:sqrt_education              1    0.5137  94.947 -570.84
## <none>                                                 95.461 -570.64
## + log_NCD:log_undernourishment            1    0.0432  95.417 -568.83
## + log_NCD:sqrt_education                  1    0.0133  95.447 -568.70
## + log_tuberculousis:log_undernourishment  1    0.0000  95.461 -568.64
## - log_tuberculousis:log_suicide           1    1.1543  96.615 -567.74
## - log_NCD:log_suicide                     1    1.8020  97.262 -565.01
## - log_undernourishment:log_suicide        1    3.4165  98.877 -558.30
## - sqrt_education:log_tuberculousis        1    3.6699  99.130 -557.25
## - sqrt_education:log_undernourishment     1    9.3535 104.814 -534.50
## 
## Step:  AIC=-571.24
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment + 
##     sqrt_education:log_tuberculousis + log_undernourishment:log_suicide + 
##     log_NCD:log_suicide + log_tuberculousis:log_suicide
## 
##                                          Df Sum of Sq     RSS     AIC
## + log_tuberculousis:log_income            1    2.9058  91.948 -581.94
## + log_tuberculousis:log_NCD               1    0.4662  94.388 -571.25
## <none>                                                 94.854 -571.24
## + log_income:log_undernourishment         1    0.3720  94.482 -570.85
## + log_suicide:sqrt_education              1    0.3549  94.499 -570.77
## - log_income                              1    0.6066  95.461 -570.64
## + log_income:sqrt_education               1    0.1146  94.739 -569.74
## + log_income:log_suicide                  1    0.1061  94.748 -569.70
## + log_tuberculousis:log_undernourishment  1    0.0298  94.824 -569.37
## + log_NCD:log_undernourishment            1    0.0093  94.845 -569.28
## + log_NCD:log_income                      1    0.0077  94.846 -569.28
## + log_NCD:sqrt_education                  1    0.0009  94.853 -569.25
## - log_tuberculousis:log_suicide           1    1.0905  95.944 -568.58
## - log_NCD:log_suicide                     1    1.9225  96.776 -565.06
## - log_undernourishment:log_suicide        1    3.2360  98.090 -559.56
## - sqrt_education:log_tuberculousis        1    3.7161  98.570 -557.56
## - sqrt_education:log_undernourishment     1    9.5640 104.418 -534.05
## 
## Step:  AIC=-581.94
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment + 
##     sqrt_education:log_tuberculousis + log_undernourishment:log_suicide + 
##     log_NCD:log_suicide + log_tuberculousis:log_suicide + log_tuberculousis:log_income
## 
##                                          Df Sum of Sq     RSS     AIC
## + log_tuberculousis:log_NCD               1    1.9597  89.988 -588.73
## + log_income:log_undernourishment         1    1.3363  90.612 -585.91
## + log_NCD:log_income                      1    0.5066  91.442 -582.19
## <none>                                                 91.948 -581.94
## + log_tuberculousis:log_undernourishment  1    0.4121  91.536 -581.77
## + log_suicide:sqrt_education              1    0.2821  91.666 -581.19
## + log_income:log_suicide                  1    0.1527  91.795 -580.62
## - sqrt_education:log_tuberculousis        1    0.8553  92.803 -580.16
## + log_income:sqrt_education               1    0.0294  91.919 -580.07
## + log_NCD:log_undernourishment            1    0.0130  91.935 -580.00
## + log_NCD:sqrt_education                  1    0.0122  91.936 -579.99
## - log_tuberculousis:log_suicide           1    1.7881  93.736 -576.08
## - log_NCD:log_suicide                     1    2.2577  94.206 -574.04
## - log_tuberculousis:log_income            1    2.9058  94.854 -571.24
## - log_undernourishment:log_suicide        1    3.3724  95.320 -569.24
## - sqrt_education:log_undernourishment     1    9.5632 101.511 -543.57
## 
## Step:  AIC=-588.73
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment + 
##     sqrt_education:log_tuberculousis + log_undernourishment:log_suicide + 
##     log_NCD:log_suicide + log_tuberculousis:log_suicide + log_tuberculousis:log_income + 
##     log_NCD:log_tuberculousis
## 
##                                          Df Sum of Sq    RSS     AIC
## + log_income:log_undernourishment         1    1.0203 88.968 -591.38
## + log_income:log_suicide                  1    0.4932 89.495 -588.97
## <none>                                                89.988 -588.73
## + log_tuberculousis:log_undernourishment  1    0.2872 89.701 -588.03
## + log_suicide:sqrt_education              1    0.1456 89.843 -587.39
## + log_NCD:log_income                      1    0.1111 89.877 -587.23
## + log_NCD:log_undernourishment            1    0.0649 89.923 -587.02
## + log_NCD:sqrt_education                  1    0.0190 89.969 -586.81
## + log_income:sqrt_education               1    0.0003 89.988 -586.73
## - log_NCD:log_suicide                     1    1.6067 91.595 -583.51
## - sqrt_education:log_tuberculousis        1    1.7712 91.760 -582.77
## - log_NCD:log_tuberculousis               1    1.9597 91.948 -581.94
## - log_tuberculousis:log_suicide           1    2.6337 92.622 -578.96
## - log_undernourishment:log_suicide        1    3.3644 93.353 -575.75
## - log_tuberculousis:log_income            1    4.3994 94.388 -571.25
## - sqrt_education:log_undernourishment     1    9.0513 99.040 -551.62
## 
## Step:  AIC=-591.38
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment + 
##     sqrt_education:log_tuberculousis + log_undernourishment:log_suicide + 
##     log_NCD:log_suicide + log_tuberculousis:log_suicide + log_tuberculousis:log_income + 
##     log_NCD:log_tuberculousis + log_undernourishment:log_income
## 
##                                          Df Sum of Sq    RSS     AIC
## + log_income:log_suicide                  1    0.5554 88.413 -591.94
## <none>                                                88.968 -591.38
## + log_income:sqrt_education               1    0.3547 88.613 -591.01
## + log_tuberculousis:log_undernourishment  1    0.2331 88.735 -590.45
## + log_suicide:sqrt_education              1    0.1809 88.787 -590.21
## + log_NCD:log_undernourishment            1    0.0846 88.883 -589.77
## + log_NCD:sqrt_education                  1    0.0174 88.951 -589.46
## + log_NCD:log_income                      1    0.0057 88.962 -589.41
## - log_undernourishment:log_income         1    1.0203 89.988 -588.73
## - sqrt_education:log_tuberculousis        1    1.2640 90.232 -587.62
## - log_NCD:log_tuberculousis               1    1.6438 90.612 -585.91
## - log_NCD:log_suicide                     1    1.6563 90.624 -585.85
## - log_tuberculousis:log_suicide           1    2.1914 91.159 -583.45
## - log_undernourishment:log_suicide        1    3.9055 92.874 -575.85
## - log_tuberculousis:log_income            1    5.1878 94.156 -570.26
## - sqrt_education:log_undernourishment     1    9.3941 98.362 -552.43
## 
## Step:  AIC=-591.94
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment + 
##     sqrt_education:log_tuberculousis + log_undernourishment:log_suicide + 
##     log_NCD:log_suicide + log_tuberculousis:log_suicide + log_tuberculousis:log_income + 
##     log_NCD:log_tuberculousis + log_undernourishment:log_income + 
##     log_suicide:log_income
## 
##                                          Df Sum of Sq    RSS     AIC
## + log_income:sqrt_education               1    0.7674 87.645 -593.49
## + log_tuberculousis:log_undernourishment  1    0.6398 87.773 -592.90
## <none>                                                88.413 -591.94
## - log_suicide:log_income                  1    0.5554 88.968 -591.38
## + log_suicide:sqrt_education              1    0.2884 88.124 -591.27
## + log_NCD:log_undernourishment            1    0.0761 88.336 -590.29
## + log_NCD:sqrt_education                  1    0.0522 88.360 -590.18
## + log_NCD:log_income                      1    0.0355 88.377 -590.10
## - sqrt_education:log_tuberculousis        1    0.8826 89.295 -589.88
## - log_undernourishment:log_income         1    1.0825 89.495 -588.97
## - log_NCD:log_suicide                     1    1.3099 89.722 -587.93
## - log_NCD:log_tuberculousis               1    1.9828 90.395 -584.89
## - log_tuberculousis:log_suicide           1    2.5905 91.003 -582.15
## - log_undernourishment:log_suicide        1    4.4591 92.872 -573.86
## - log_tuberculousis:log_income            1    5.6298 94.042 -568.75
## - sqrt_education:log_undernourishment     1    9.8874 98.300 -550.68
## 
## Step:  AIC=-593.49
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment + 
##     sqrt_education:log_tuberculousis + log_undernourishment:log_suicide + 
##     log_NCD:log_suicide + log_tuberculousis:log_suicide + log_tuberculousis:log_income + 
##     log_NCD:log_tuberculousis + log_undernourishment:log_income + 
##     log_suicide:log_income + sqrt_education:log_income
## 
##                                          Df Sum of Sq    RSS     AIC
## - sqrt_education:log_tuberculousis        1    0.3961 88.041 -593.65
## <none>                                                87.645 -593.49
## + log_suicide:sqrt_education              1    0.3707 87.274 -593.22
## + log_tuberculousis:log_undernourishment  1    0.3158 87.329 -592.96
## - sqrt_education:log_income               1    0.7674 88.413 -591.94
## + log_NCD:log_income                      1    0.0330 87.612 -591.65
## + log_NCD:log_undernourishment            1    0.0106 87.635 -591.54
## + log_NCD:sqrt_education                  1    0.0089 87.636 -591.53
## - log_suicide:log_income                  1    0.9681 88.613 -591.01
## - log_NCD:log_suicide                     1    1.0548 88.700 -590.61
## - log_NCD:log_tuberculousis               1    1.6562 89.301 -587.85
## - log_undernourishment:log_income         1    1.8094 89.455 -587.15
## - log_tuberculousis:log_suicide           1    2.7085 90.354 -583.07
## - log_undernourishment:log_suicide        1    4.7572 92.402 -573.93
## - log_tuberculousis:log_income            1    6.0929 93.738 -568.07
## - sqrt_education:log_undernourishment     1    6.7236 94.369 -565.34
## 
## Step:  AIC=-593.65
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment + 
##     log_undernourishment:log_suicide + log_NCD:log_suicide + 
##     log_tuberculousis:log_suicide + log_tuberculousis:log_income + 
##     log_NCD:log_tuberculousis + log_undernourishment:log_income + 
##     log_suicide:log_income + sqrt_education:log_income
## 
##                                          Df Sum of Sq    RSS     AIC
## + log_suicide:sqrt_education              1    0.4848 87.557 -593.90
## <none>                                                88.041 -593.65
## + log_tuberculousis:sqrt_education        1    0.3961 87.645 -593.49
## + log_NCD:log_income                      1    0.1028 87.939 -592.13
## + log_tuberculousis:log_undernourishment  1    0.0269 88.014 -591.78
## + log_NCD:sqrt_education                  1    0.0035 88.038 -591.67
## + log_NCD:log_undernourishment            1    0.0003 88.041 -591.65
## - log_NCD:log_suicide                     1    1.1515 89.193 -590.35
## - sqrt_education:log_income               1    1.2540 89.295 -589.88
## - log_NCD:log_tuberculousis               1    1.3234 89.365 -589.56
## - log_suicide:log_income                  1    1.4823 89.524 -588.84
## - log_undernourishment:log_income         1    2.6273 90.669 -583.65
## - log_tuberculousis:log_suicide           1    2.9166 90.958 -582.36
## - log_undernourishment:log_suicide        1    5.6335 93.675 -570.35
## - sqrt_education:log_undernourishment     1    7.8362 95.877 -560.86
## - log_tuberculousis:log_income            1    8.7134 96.755 -557.15
## 
## Step:  AIC=-593.9
## life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
##     log_tuberculousis + log_suicide + log_income + sqrt_education:log_undernourishment + 
##     log_undernourishment:log_suicide + log_NCD:log_suicide + 
##     log_tuberculousis:log_suicide + log_tuberculousis:log_income + 
##     log_NCD:log_tuberculousis + log_undernourishment:log_income + 
##     log_suicide:log_income + sqrt_education:log_income + sqrt_education:log_suicide
## 
##                                          Df Sum of Sq    RSS     AIC
## <none>                                                87.557 -593.90
## - sqrt_education:log_suicide              1    0.4848 88.041 -593.65
## + log_tuberculousis:sqrt_education        1    0.2821 87.274 -593.22
## + log_NCD:log_income                      1    0.1039 87.453 -592.39
## + log_NCD:sqrt_education                  1    0.0771 87.479 -592.26
## + log_tuberculousis:log_undernourishment  1    0.0458 87.511 -592.12
## + log_NCD:log_undernourishment            1    0.0048 87.552 -591.93
## - log_NCD:log_tuberculousis               1    1.2451 88.802 -590.14
## - sqrt_education:log_income               1    1.3082 88.865 -589.85
## - log_NCD:log_suicide                     1    1.4632 89.020 -589.14
## - log_suicide:log_income                  1    1.6721 89.229 -588.19
## - log_undernourishment:log_suicide        1    2.1506 89.707 -586.00
## - log_undernourishment:log_income         1    2.7111 90.268 -583.46
## - log_tuberculousis:log_suicide           1    2.8991 90.456 -582.61
## - sqrt_education:log_undernourishment     1    7.5734 95.130 -562.06
## - log_tuberculousis:log_income            1    8.5091 96.066 -558.06

Further, we draw the interaction plots to visualize the model with the minimum value of AIC in order to prove the model in stepAIC.

library(interactions)

lm_refit1=lm(life_expectancy ~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+sqrt_education:log_undernourishment, data = new_data)
interact_plot(lm_refit1, pred=sqrt_education, modx=log_undernourishment)
## Warning: 0.862584248762756 is outside the observed range of log_undernourishment

lm_refit2=lm(life_expectancy ~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_income:log_tuberculousis, data = new_data)
interact_plot(lm_refit2, pred=log_income, modx=log_tuberculousis)

lm_refit3=lm(life_expectancy ~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_suicide:log_undernourishment, data = new_data)
interact_plot(lm_refit3, pred=log_suicide, modx=log_undernourishment)
## Warning: 0.862584248762756 is outside the observed range of log_undernourishment

lm_refit4=lm(life_expectancy~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+log_tuberculousis:log_suicide, data = new_data)
interact_plot(lm_refit4, pred=log_tuberculousis, modx=log_suicide)

lm_refit5=lm(life_expectancy~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+log_tuberculousis:log_income, data = new_data)
interact_plot(lm_refit5, pred=log_tuberculousis, modx=log_income)

lm_refit6=lm(life_expectancy~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+log_NCD:log_tuberculousis, data = new_data)
interact_plot(lm_refit6, pred=log_NCD, modx=log_tuberculousis)

lm_refit7=lm(life_expectancy~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+log_undernourishment:log_income, data = new_data)
interact_plot(lm_refit7, pred=log_undernourishment, modx=log_income)

lm_refit8=lm(life_expectancy~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+log_suicide:log_income, data = new_data)
interact_plot(lm_refit8, pred=log_suicide, modx=log_income)

lm_refit9=lm(life_expectancy~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+sqrt_education:log_income, data = new_data)
interact_plot(lm_refit9, pred=sqrt_education, modx=log_income)

lm_refit10=lm(life_expectancy~ log_tuberculousis + log_NCD + log_income + log_suicide + sqrt_education + log_undernourishment+sqrt_education:log_suicide, data = new_data)
interact_plot(lm_refit10, pred=sqrt_education, modx=log_suicide)

We observe that the lines are not parallel in the sqrt(education) and log(undernourishment), log(undernourishment) and log(suicide), log(NCD) and log(suicide), log(tuberculousis) and log(suicide), log(tuberculousis) and log(income), log(NCD) and log(tuberculousis), log(undernourishment) and log(income), log(suicide) and log(income), sqrt(education) and log(income), sqrt(education) and log(suicide) interaction plots. Therefore, we include these ten interactions into our final multiple linear regression model.

VIF Values

We use VIF values to measures the strength of the correlation between the independent variables in regression analysis in order to avoid the occurrence of multicollinearity which inflates the variance and type II error.

f_lm1=lm(life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
    log_suicide + log_income + sqrt_education:log_undernourishment + 
    log_undernourishment:log_suicide + log_NCD:log_suicide + 
    log_tuberculousis:log_suicide + log_tuberculousis:log_income + 
    log_NCD:log_tuberculousis + log_undernourishment:log_income + 
    log_suicide:log_income + sqrt_education:log_income + sqrt_education:log_suicide, data = new_data)  

vif(f_lm1)
##                             log_NCD                      sqrt_education 
##                           21.414871                          303.185493 
##                log_undernourishment                         log_suicide 
##                          244.869351                          829.862161 
##                          log_income sqrt_education:log_undernourishment 
##                           77.857803                            7.959714 
##    log_undernourishment:log_suicide                 log_NCD:log_suicide 
##                           54.516802                          699.145664 
##       log_suicide:log_tuberculousis        log_income:log_tuberculousis 
##                           21.659061                           49.025468 
##           log_NCD:log_tuberculousis     log_undernourishment:log_income 
##                           69.211869                          141.275009 
##              log_suicide:log_income           sqrt_education:log_income 
##                           97.059594                          318.407955 
##          sqrt_education:log_suicide 
##                           63.301258
f_lm2=lm(life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
     log_income + sqrt_education:log_undernourishment + 
    log_undernourishment:log_suicide + log_NCD:log_suicide + 
    log_tuberculousis:log_suicide + log_tuberculousis:log_income + 
    log_NCD:log_tuberculousis + log_undernourishment:log_income + 
    log_suicide:log_income + sqrt_education:log_income + sqrt_education:log_suicide, data = new_data)  

vif(f_lm2)
##                             log_NCD                      sqrt_education 
##                            6.705396                          269.009692 
##                log_undernourishment                          log_income 
##                          241.248338                           69.908005 
## sqrt_education:log_undernourishment    log_undernourishment:log_suicide 
##                            7.947199                           53.760328 
##                 log_NCD:log_suicide       log_suicide:log_tuberculousis 
##                          126.264056                           21.524340 
##        log_income:log_tuberculousis           log_NCD:log_tuberculousis 
##                           48.713482                           68.309443 
##     log_undernourishment:log_income              log_income:log_suicide 
##                          140.201731                           77.497715 
##           sqrt_education:log_income          sqrt_education:log_suicide 
##                          303.288666                           53.448179
f_lm3=lm(life_expectancy ~ log_NCD + sqrt_education + log_undernourishment + 
     log_income + sqrt_education:log_undernourishment + 
    log_undernourishment:log_suicide + log_NCD:log_suicide + 
    log_tuberculousis:log_suicide + log_tuberculousis:log_income + 
    log_NCD:log_tuberculousis + log_undernourishment:log_income + 
    log_suicide:log_income + sqrt_education:log_suicide, data = new_data)  

vif(f_lm3)
##                             log_NCD                      sqrt_education 
##                            6.499978                           43.231083 
##                log_undernourishment                          log_income 
##                          171.678544                           25.238230 
## sqrt_education:log_undernourishment    log_undernourishment:log_suicide 
##                            6.016976                           53.729823 
##                 log_NCD:log_suicide       log_suicide:log_tuberculousis 
##                          124.463091                           21.483681 
##        log_income:log_tuberculousis           log_NCD:log_tuberculousis 
##                           48.646500                           68.297705 
##     log_undernourishment:log_income              log_income:log_suicide 
##                          109.326006                           74.271945 
##          sqrt_education:log_suicide 
##                           52.492097
f_lm4=lm(life_expectancy ~ log_NCD + sqrt_education  + 
     log_income + sqrt_education:log_undernourishment + 
    log_undernourishment:log_suicide + log_NCD:log_suicide + 
    log_tuberculousis:log_suicide + log_tuberculousis:log_income + 
    log_NCD:log_tuberculousis + log_undernourishment:log_income + 
    log_suicide:log_income + sqrt_education:log_suicide, data = new_data)  

vif(f_lm4)
##                             log_NCD                      sqrt_education 
##                            6.361508                           42.842948 
##                          log_income sqrt_education:log_undernourishment 
##                           15.334013                            5.648248 
##    log_undernourishment:log_suicide                 log_NCD:log_suicide 
##                           39.046183                          116.204892 
##       log_suicide:log_tuberculousis        log_income:log_tuberculousis 
##                           21.330630                           45.316191 
##           log_NCD:log_tuberculousis     log_income:log_undernourishment 
##                           63.348578                           26.964444 
##              log_income:log_suicide          sqrt_education:log_suicide 
##                           74.082741                           49.475127
f_lm5=lm(life_expectancy ~ log_NCD + sqrt_education  + 
     log_income + sqrt_education:log_undernourishment + 
    log_undernourishment:log_suicide + 
    log_tuberculousis:log_suicide + log_tuberculousis:log_income + 
    log_NCD:log_tuberculousis + log_undernourishment:log_income + 
    log_suicide:log_income + sqrt_education:log_suicide, data = new_data)  

vif(f_lm5)
##                             log_NCD                      sqrt_education 
##                            2.274173                           39.277067 
##                          log_income sqrt_education:log_undernourishment 
##                           14.265740                            5.550315 
##    log_undernourishment:log_suicide       log_suicide:log_tuberculousis 
##                           21.477249                           17.190999 
##        log_income:log_tuberculousis           log_NCD:log_tuberculousis 
##                           44.045708                           63.263552 
##     log_income:log_undernourishment              log_income:log_suicide 
##                           19.032463                           36.644697 
##          sqrt_education:log_suicide 
##                           45.462863
f_lm6=lm(life_expectancy ~ log_NCD + sqrt_education  + 
     log_income + sqrt_education:log_undernourishment + 
    log_undernourishment:log_suicide + 
    log_tuberculousis:log_suicide + log_tuberculousis:log_income + 
    log_undernourishment:log_income + 
    log_suicide:log_income + sqrt_education:log_suicide, data = new_data)  

vif(f_lm6)
##                             log_NCD                      sqrt_education 
##                            1.981848                           38.464524 
##                          log_income sqrt_education:log_undernourishment 
##                           13.322987                            5.536814 
##    log_undernourishment:log_suicide       log_suicide:log_tuberculousis 
##                           21.015437                           15.818216 
##        log_income:log_tuberculousis     log_income:log_undernourishment 
##                            7.820429                           18.960750 
##              log_income:log_suicide          sqrt_education:log_suicide 
##                           34.966575                           44.890353
f_lm7=lm(life_expectancy ~ log_NCD + sqrt_education  + 
     log_income + sqrt_education:log_undernourishment + 
    log_undernourishment:log_suicide + 
    log_tuberculousis:log_suicide + log_tuberculousis:log_income + 
    log_undernourishment:log_income + 
    log_suicide:log_income , data = new_data)  

vif(f_lm7)
##                             log_NCD                      sqrt_education 
##                            1.971445                           11.159202 
##                          log_income sqrt_education:log_undernourishment 
##                            8.014695                            5.517505 
##    log_undernourishment:log_suicide       log_suicide:log_tuberculousis 
##                           14.494598                           15.600938 
##        log_income:log_tuberculousis     log_income:log_undernourishment 
##                            7.761027                           16.060096 
##              log_income:log_suicide 
##                            6.928652
f_lm8=lm(life_expectancy ~ log_NCD + sqrt_education  + 
     log_income + sqrt_education:log_undernourishment + 
    log_undernourishment:log_suicide + 
    log_tuberculousis:log_suicide + log_tuberculousis:log_income + 
    log_suicide:log_income , data = new_data)  

vif(f_lm8)
##                             log_NCD                      sqrt_education 
##                            1.970172                            5.792327 
##                          log_income sqrt_education:log_undernourishment 
##                            4.910514                            2.289050 
##    log_undernourishment:log_suicide       log_suicide:log_tuberculousis 
##                            6.727141                           15.402695 
##        log_income:log_tuberculousis              log_income:log_suicide 
##                            7.542750                            5.683371
f_lm9=lm(life_expectancy ~ log_NCD + sqrt_education  + 
     log_income + sqrt_education:log_undernourishment + 
    log_undernourishment:log_suicide + 
    log_tuberculousis:log_income + 
    log_suicide:log_income , data = new_data)  

vif(f_lm9)
##                             log_NCD                      sqrt_education 
##                            1.957256                            5.592539 
##                          log_income sqrt_education:log_undernourishment 
##                            3.201881                            2.117339 
##    log_undernourishment:log_suicide        log_income:log_tuberculousis 
##                            5.600148                            1.118468 
##              log_income:log_suicide 
##                            3.406178

We only keep variables and interaction terms which have vif values below 10 in our final linear regression model.

final_lm = lm(life_expectancy ~ log_NCD + sqrt_education  + 
    log_income + sqrt_education:log_undernourishment + 
    log_undernourishment:log_suicide + 
    log_tuberculousis:log_income + 
    log_suicide:log_income , data = new_data)  
summary(final_lm)
## 
## Call:
## lm(formula = life_expectancy ~ log_NCD + sqrt_education + log_income + 
##     sqrt_education:log_undernourishment + log_undernourishment:log_suicide + 
##     log_tuberculousis:log_income + log_suicide:log_income, data = new_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.77881 -0.31544  0.02725  0.33786  1.92443 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         62.907978   0.841728  74.737  < 2e-16 ***
## log_NCD                             -6.940823   0.112092 -61.921  < 2e-16 ***
## sqrt_education                       0.081917   0.025634   3.196  0.00151 ** 
## log_income                          -0.005242   0.032256  -0.163  0.87097    
## sqrt_education:log_undernourishment  0.084678   0.011754   7.204 2.92e-12 ***
## log_undernourishment:log_suicide    -0.286667   0.024175 -11.858  < 2e-16 ***
## log_income:log_tuberculousis        -0.012149   0.002478  -4.903 1.38e-06 ***
## log_income:log_suicide               0.035776   0.006310   5.670 2.74e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.523 on 400 degrees of freedom
## Multiple R-squared:  0.9693, Adjusted R-squared:  0.9688 
## F-statistic:  1805 on 7 and 400 DF,  p-value: < 2.2e-16

Discussion Section

(1) Analysis of coefficients

From the summary table, we can analyse the coefficients which we are interested in:

48.3% is the expected decrease in life expectancy if we were to increase the log of the number of NCD Death by one unit, keeping everything else constant. It can also be interpreted as the effect of the log of the number of NCD Death on life expectancy, controlling for the rest of the 6 variables in the model.

0.67% is the expected increase in life expectancy if we were to increase the log of tertiary school enrollment rate by one unit, keeping everything else constant. It can also be interpreted as the effect of tertiary school enrollment rate on life expectancy, controlling for the rest of the 6 variables in the model.

(2) Interaction Effects

The coefficient of square root of School enrollment, tertiary (% gross) increases by 0.084678 for every unit increase on the log of prevalence of undernourishment

The coefficient of log of prevalence of undernourishment decreases by 0.286667 for every unit increase on the log of Age-standardized mortality rate (per 100 000 population) .

The coefficient of log of the number of death due to tuberculosis (excluding HIV) decreases by 0.005276 for every unit increase on the log of per adult national income.

The coefficient of log of Age-standardized mortality rate (per 100 000 population) increases by 0.035776 for every unit increase on the log of per adult national income.

(3) VIF Values By testing the VIF, we only keep the log(NCD), sqrt(education), log(income) and interaction terms of sqrt(education) and log(undernourishment),log(undernourishment) and log(suicide),log(tuberculousis) and log(income),log(suicide) and log(income) which have VIF values below 10. It means that these variables and interaction terms has no highly correlation between each other.

(4) P-values

The P-values of the variables of log of NCD, square root of education, the interaction term between the square root of education and the log of undernourishment, the interaction term between the log of undernourishment and the log of suicide, the interaction term between the log of tuberculousis and the log of income, and the interaction term between the log of suicide and the log of income are less than 0.05, which implies that these terms have significant effects on the response variable.

(5) Limitations

One limitation is the problem of missing data in variable education (School enrollment, tertiary (% gross)) and negative values in variable income (Per adult national income). The reason these missing or negative data is a problem is that after we apply non-linear transformations–log() to income and sqrt() to education, there will be a large amount of NaN and -Inf in the dataset.

To get rid of this problem, we tried to approximate missing data in education by calculating a function for education, that is, finding the interpolation for missing data. The idea is countries with missing data in the four years we focus on still have available data in other years. If we could find a general trend about how education value change in different years, we could calculate an estimated value for missing data.However, by drawing the plot of known education values of twenty randomly selected countries, and repeating the process for three times, we failed to find a function for the interpolation.

Below are three plots of known education values:

library(tidyverse)
library(readxl)

school = read_xls("schooling.xls",sheet=2)

school %>% pivot_longer(`1960`:last_col(),names_to="year",values_to="val") %>% 
  group_by(`Country Name`) %>% 
  summarise(pre2000 = sum((year<2000)*(val*0+1),na.rm=T),
            btw2000.2010 = sum((year>2000)*(year<2010)*(val*0+1),na.rm=T),
            btw2010.2015 = sum((year>2010)*(year<2015)*(val*0+1),na.rm=T),
            post2015 = sum((year>2015)*(val*0+1),na.rm=T)) -> school.ys

par(mfrow = c(3,1))

school %>% 
  pivot_longer(`1960`:last_col(),names_to="year",values_to="val") %>% 
  filter(`Country Name` %in% sample(unique(`Country Name`),20)) %>%
  ggplot(aes(x=as.numeric(year),y=val,color=`Country Name`))+geom_point()+geom_line()
## Warning: Removed 696 rows containing missing values (geom_point).
## Warning: Removed 582 row(s) containing missing values (geom_path).

school %>% 
  pivot_longer(`1960`:last_col(),names_to="year",values_to="val") %>% 
  filter(`Country Name` %in% sample(unique(`Country Name`),20)) %>%
  ggplot(aes(x=as.numeric(year),y=val,color=`Country Name`))+geom_point()+geom_line()
## Warning: Removed 583 rows containing missing values (geom_point).
## Warning: Removed 447 row(s) containing missing values (geom_path).

school %>% 
  pivot_longer(`1960`:last_col(),names_to="year",values_to="val") %>% 
  filter(`Country Name` %in% sample(unique(`Country Name`),20)) %>%
  ggplot(aes(x=as.numeric(year),y=val,color=`Country Name`))+geom_point()+geom_line()
## Warning: Removed 734 rows containing missing values (geom_point).
## Warning: Removed 525 row(s) containing missing values (geom_path).

We could see that there is not a fixed trend of these data. Some of the lines are linear, some are curves, some are not even monotone(i.e. The school enrollment value of that country may increase first, then decrease, and increase again). Without a good function, applying approximation for missing data may cause a bigger bias. With this consideration, we chose to delete the year with missing education value from that country.

In addition, since the log of a negative value is undefined, which would cause a problem in the linear regression model, we chose to delete the year with negative income value from that country.

Conclusion Section

We collected the life expectancy data and seven explanatory variables data we are interested in. By constructing multiple linear regression model and anova model, we find that log(NCD), sqrt(education), log(income) and interaction terms of sqrt(education) and log(undernourishment),log(undernourishment) and log(suicide),log(tuberculousis) and log(income),log(suicide) and log(income) have significant effect on life expectancy at Age 60 (response variable).